4. MySql tables

Here we describe MySql tables used by locust and their columns. Tables design was borrowed from ASPseek.

wordurl

This table keeps the database dictionary.

word
Word itself in the unicode.
word_id
The word numerical handle.

urlword

This table keeps information about all encountered URLs, both indexed and not indexed yet which match conditions specified in configuration files.

url_id
ID of URL.
site_id
ID of site, refers to sites.site_id.
deleted
Set to 1 if server returned an error.
url
URL itself.
next_index_time
Time of next indexing in seconds from UNIX epoch.
status
HTTP status returned by server or 0 if document has not been indexed yet.
crc
MD5 checksum of document.
last_modified
"Last-Modified" field in the HTTP header.
etag
"ETag" field in the HTTP header.
last_index_time
Time of last indexing in seconds from UNIX epoch.
referrer
ID of URL which first referred this URL.
tag
Arbitrary tag.
hops
Depth of URL in hyperlink tree.
redir
URL ID, where current URL is redirected or 0 if this URL is not redirected.
origin
Set to 0 for the original, 1 for a clone.

urlwordsNN (where NN is 2-digit number from 00-15)

These tables contain additional info about existing indexed URLs. Number NN in table name is URL_ID mod 16.

deleted
Set to 1 if server returned an error.
wordcount
Count of unique words in the indexed part of URL.
totalcount
Total count of words in the indexed part of URL.
content_type
Content-Type HTTP header returned by server.
charset
Document charset taken from Content-Type HTTP header or META.
title
First 128 characters from pages title.
txt
First 255 characters from page body, stripped from HTML tags.
docsize
Total size of URL.
keywords
First 255 characters from page keywords.
description
First 100 characters from page description.
lang
Not used now.
words
Zipped content of URL.
hrefs
Sorted array of outgoing href IDs from this URL.

In the first 4 bytes (size of unsigned), the blob field words contains the size of the uncompressed document content or the value 0xFFFFFFFF if the content is stored uncompressed (this may happen if the compressed content is longer than the uncompressed one or if compression fails). The rest of the blob contains compressed or uncompressed content.

robots

This table contains information parsed from robots.txt file for each site.

hostinfo
Host name.
path
Path to exclude from indexing.

sites

This table contains IDs for all indexed sites.

site_id
ID of site.
site
Site name with protocol, like http://www.my.com/.

stat

This table contains information about query statistics for each completed query.

addr
IP address of computer, from which query was requested.
proxy
IP address of proxy server, through which query was requested.
query
Query string.
ul
URL limit used to restrict the query.
sp
Web spaces used to restrict the query.
site
Site ID used to restrict the query.
np
Results page number requested.
ps
Results per page.
sites
Number of found sites matching query.
urls
Number of found URLs matching query.
start
Query processing start in seconds from UNIX epoch.
finish
Query processing finish in seconds from UNIX epoch.
referer
URL of web page from which query was requested.